Skip to content

fix: delete cni statefile when unable to be parsed #3551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

QxBytes
Copy link
Contributor

@QxBytes QxBytes commented Apr 1, 2025

Reason for Change:

Sometimes in certain scenarios (usually windows), if there is a crash of the OS, null bytes may be written to the state and log file. When the CNI tries to restore the state, it is unable to read the statefile and fails. All subsequent retries will fail as the state is irrecoverable. This PR changes this behavior to delete the entire cni statefile if there is a syntax error (ex: if there are a bunch of null bytes in the file), as manual intervention would be needed to recover anyway. The null statefile issue only seems to appear on the pipelines on windows nodes.

Issue Fixed:

See above

Requirements:

Notes:
This issue appears sporadically

@QxBytes QxBytes self-assigned this Apr 1, 2025
@QxBytes QxBytes added fix Fixes something. ci Infra or tooling. labels Apr 1, 2025
@@ -192,13 +193,19 @@ func (nm *networkManager) restore(isRehydrationRequired bool) error {
// Read any persisted state.
err := nm.store.Read(storeKey, nm)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to log the contents invalid and all to help assist with troubleshooting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking into this and the store's file name and contents are not accessible-- currently leaning towards just adding a method to the interface to dump the contents of the store in string format.

Copy link

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Apr 18, 2025
@rbtr rbtr removed the stale Stale due to inactivity. label Apr 21, 2025
@QxBytes QxBytes force-pushed the alew/mitigate-null-statefile branch from e6da159 to a6a3b34 Compare April 25, 2025 23:41
adds Dump to store interface which returns the raw contents of the store in string format
if the store is empty or not readable it returns an error
adds to a ut to check that dump effectively prints the contents of the store
@QxBytes QxBytes force-pushed the alew/mitigate-null-statefile branch from a6a3b34 to c94573b Compare April 26, 2025 00:09
@QxBytes QxBytes marked this pull request as ready for review May 9, 2025 18:52
@QxBytes QxBytes requested a review from a team as a code owner May 9, 2025 18:52
@QxBytes QxBytes requested review from bohuini and Copilot May 9, 2025 18:52
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a failure scenario where the statefile cannot be parsed due to potential corruption (e.g., null bytes), by deleting the corrupted file to allow recovery. Key changes include:

  • Adding a Dump() method to the KeyValueStore interface and its implementations.
  • Updating unit tests to validate Dump() behavior for both empty and populated stores.
  • Adjusting the restore logic to delete the statefile when a JSON syntax error is encountered.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
testutils/store_mock.go Added Dump() method to the mock store for testing.
store/store.go Extended the KeyValueStore interface with Dump().
store/mockstore.go Implemented Dump() method for the mock store.
store/json_test.go Added tests for verifying Dump() behavior.
store/json.go Added Dump() and readBytes() methods for file handling.
network/manager.go Updated restore logic to delete corrupted state files.
Comments suppressed due to low confidence (1)

network/manager.go:213

  • Consider handling the error returned by nm.store.Remove() to ensure that removal of the corrupted state file is successful and to improve error reporting.
nm.store.Remove()

@QxBytes QxBytes requested a review from Copilot May 9, 2025 19:56
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue where the CNI statefile becomes corrupted (e.g., containing null bytes) causing parsing failures by deleting the irrecoverable statefile automatically. Key changes include:

  • Adding a Dump() method to the KeyValueStore interface and its implementations (jsonFileStore, mockStore, storeMock).
  • Updating unit tests to check the Dump() behavior on both empty and populated statefiles.
  • Modifying the restore logic in network manager to delete the statefile when encountering JSON syntax errors.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
testutils/store_mock.go Added Dump() stub for the KeyValueStoreMock.
store/store.go Updated KeyValueStore interface with Dump() method.
store/mockstore.go Implemented Dump() for mockStore returning formatted data.
store/json_test.go Updated tests to verify Dump() behavior for empty and populated stores.
store/json.go Refactored file reading logic; included Dump() implementation.
network/manager.go Enhanced state restoration by deleting corrupted statefiles.
Comments suppressed due to low confidence (1)

store/json_test.go:169

  • [nitpick] Consider unmarshaling both the dumped JSON and the expected JSON and comparing their data structures instead of modifying the string, which would improve test robustness and maintainability.
val = strings.ReplaceAll(val, " ","")

Comment on lines +135 to +137
_, err = kvs.Dump()
if err == nil {
t.Fatal("Expected store to not exist")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expect the store to exist right ? why the nil check and a fatal error ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The store struct does exist, but the actual json representation on disk doesn't exist yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the store struct is created, but it's not yet created on disk

if readErr != nil {
logger.Error("Could not read corrupted state", zap.Error(readErr))
} else {
logger.Info("Logging state", zap.String("stateFile", contents))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: logging corrupted state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We log the corrupted state for debugging as per the suggestion by Michael above. Is there some drawback of doing this?

Copy link
Contributor Author

@QxBytes QxBytes May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for debugging as per the above reviewer's comments-- what other issues could arise from logging the corrupted state? It should only be dumped once as the statefile will be deleted afterwards.

Copy link

github-actions bot commented Jun 3, 2025

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Jun 3, 2025
@QxBytes QxBytes removed the stale Stale due to inactivity. label Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Infra or tooling. fix Fixes something.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants